NFA macro + better NFA matcher #58

nitely · 2020-03-16T17:40:02Z

Changes:

Remove epsilon-transitions
Refactor regex module into many modules
Simplify most matching algorithms
Adds benchmarks against nim's re (most LoC changes are because of the input text file)

Needs benchmarking to check for regressions.

closes #57
closes #28

nitely · 2020-03-24T23:41:22Z

Quick benchmark:

============================================================================
GlobalBenchmark                                 relative  time/iter  iters/s
============================================================================
GlobalBenchmark                                            294.86ps    3.39G
============================================================================
somere.nim                                      relative  time/iter  iters/s
============================================================================
re_sol                                                       1.10ms   910.91
regex_sol                                         64.79%     1.69ms   590.20
re_nums                                                    171.08ns    5.85M
regex_nums2                                      88.46%   193.39ns    5.17M
re_nums                                                    176.39ns    5.67M
regex_nums                                       95.51%   184.69ns    5.41M

Nim's re VS nim-regex macro-NFA.

nitely · 2020-03-27T04:24:00Z

Almost done. I need to implement the find version that takes O(N*M) time, and benchmark it against the new O(N^2) version. It should be faster when N > M assuming it does not add much overhead.

nitely · 2020-04-04T00:08:24Z

benchmarks against nim-regex 0.13:

============================================================================
GlobalBenchmark                                 relative  time/iter  iters/s
============================================================================
GlobalBenchmark                                            294.87ps    3.39G
============================================================================
bench.nim                                       relative  time/iter  iters/s
============================================================================
regex_old_sol                                               42.60ms    23.48
regex_sol                                                    1.85ms   541.34
regex_old_nums                                               2.43us  410.78K
regex_nums                                                 206.64ns    4.84M
regex_old_nums2                                              2.92us  342.48K
regex_nums2                                                231.74ns    4.32M
regex_old_lits_find                                         41.71ms    23.97
regex_lits_find                                              1.03ms   968.61
email_old_find_all                                            1.27s  787.08m
email_find_all                                             472.56ms     2.12
dummy                                                        0.00fs      inf

timotheecour · 2020-04-04T02:23:34Z

this sounds amazing!

a few points:

1st benchmark shows:re_nums instead of re_nums2; not sure whether both algos are comparing the same

re_nums                                                    171.08ns    5.85M
regex_nums2                                      88.46%   193.39ns    5.17M

maybe benchmark could add comparison against nre as well
-d:danger should be used for benchmarking; perhaps also --passc:-flto;
more generally optimizations decisions should be made from -d:danger, not -d:release; otherwise you may end up drawing wrong conclusions and end up with suboptimal results in the end
maybe add nim 1.2 to CI now that it was released

Remove epsilon-transitions

haven't looked at the code change in detail but does that come with any loss of functionality?

nitely · 2020-04-04T02:37:44Z

1st benchmark shows:re_nums instead of re_nums2; not sure whether both algos are comparing the same

They are. I've fixed the name.

maybe benchmark could add comparison against nre as well

IIRC, Araq said re is faster than nre, but I can add it, no problem.

-d:danger should be used for benchmarking; perhaps also --passc:-flto;

I did set -d:danger. -d:release is only 10% slower, IIRC.

maybe add nim 1.2 to CI now that it was released

Yeah, I'm using nim's docker. I'm sure it will be added in a few days.

haven't looked at the code change in detail but does that come with any loss of functionality?

No, it's just an optimization that avoids recursion in the matching algorithm.

thomastay · 2020-04-26T08:55:03Z

FWIW, I came across this PR when trying to benchmark mariomka for nim. Seems like most of the discussion has already been had in #57, but I wanted to just add in the benchmarks that I measured on my machine. Code is here in case I did something wrong; I am rather new to this repo.

Flags: -d:danger
Nim version: 1.3.1
nim-regex: 0.14.1

Benching with default GC, taking mean of 3 iterations
std/re: 28.0 - 92
std/re: 28.0 - 5301
std/re: 9.0 - 5
nim-regex: 371.0 - 92
nim-regex: 344.3333333333333 - 5301
nim-regex: 438.6666666666667 - 5

Also, I don't think a hybrid approach for nim-regex and PCRE would be a good idea, since the selling point of nim-regex is that it's a pure Nim implementation.

nitely · 2020-04-26T09:04:29Z

find and findAll can be ~10x slower on some regexes due to lack of literal optimization (i.e: a quick memchr is done to find potential matches). I'm aware of it (see #59) and I'll optimize it as soon as I've some free time.

timotheecour · 2020-04-26T10:10:26Z

Also, I don't think a hybrid approach for nim-regex and PCRE would be a good idea, since the selling point of nim-regex is that it's a pure Nim implementation.

I'm not buying that argument.

At the end of the day, we should do what's practical and useful. Certainly nim-regex can have a pure nim implementation that keeps improving, and, all else being equal (eg performance), using a pure nim implementation is better, but nothing prevents it from benefiting from optimizations which could be controlled by some flags or library options.

This allows user code to stay identical (offering same interface) regardless of whether PCRE or other library is used (implementation detail).

Very concretely, the main point of pure nim implementation is making it available on all backends/environments (eg nim vm, maybe nim js or nimscript, or in cases where PCRE is not available); but, where performance matters and PCRE is available and higher performance than the nim engine, there is no practical reason not to benefit from it (other than implementing that bridge, which should be a lot easier than making improvements on the engine itself)

+1 for providing a benchmark though

nitely · 2020-07-16T21:27:18Z

@thomastay @timotheecour good news, my regex literal optimization is 34x faster than re for the email regex, and about the same speed for the rest. It's not even the macro version, that one would be faster, but I don't think I'll implement it.

============================================================================
GlobalBenchmark                                 relative  time/iter  iters/s
============================================================================
GlobalBenchmark                                            294.86ps    3.39G
============================================================================
bench.nim                                       relative  time/iter  iters/s
============================================================================
re_email_find_all                                           24.19ms    41.33
email_find_all                                  3389.08%   713.91us    1.40K
re_uri_find_all                                             24.75ms    40.40
uri_find_all                                      92.92%    26.64ms    37.54
re_ip_find_all                                               6.41ms   156.06
ip_find_all                                      101.27%     6.33ms   158.04
runes                                                        5.99ms   167.04
dummy                                                        0.00fs      inf

nitely changed the title ~~Revamp~~ NFA macro + better NFA matcher Mar 24, 2020

nitely added 26 commits April 3, 2020 21:09

wip

1bc1aa2

make match work

05538e7

find

ac50e5f

findAll

cb10266

split

9d67e79

ignore

29ba077

replace

e18b1dc

fix empty-match

f9c0e33

perf improvement

770128c

tests

5f6669c

Nim 0.19

ec4cca8

tests

c918374

drop nim 0.19.0 support

656bfc6

tests

546189b

tests

06b1f91

dynamic dfa

be9f29a

fast nfa

cd634a1

macro

2b40aee

perf

2b68259

perf

cce6ed7

genMatch

cbc5dbc

checks off

bf53574

gen eoe

a50a3d0

macro capture

62f32ad

nfatype

5216f65

finish macro

2fbcc6a

nitely added 10 commits April 3, 2020 21:11

remove pointles parens

426af5f

static api

409723a

findMatch

55dacff

shortMatch with findMatch

b32af49

api

c8d54fb

tests

2e20262

bench

af19e1e

bench

c8993e9

bench

579054d

bench

796bbf7

nitely force-pushed the revamp branch from 9a45dd3 to 796bbf7 Compare April 4, 2020 00:12

docs

feab91e

nitely merged commit 0570009 into master Apr 4, 2020

nitely deleted the revamp branch April 4, 2020 00:33

timotheecour mentioned this pull request Apr 4, 2020

misc tracking of unresolved issues/comments timotheecour/Nim#87

Open

timotheecour mentioned this pull request May 28, 2020

re.split unexpected results with zero-width characters nim-lang/Nim#14468

Closed

timotheecour mentioned this pull request May 4, 2021

re/nre: mention nim-regex as the alternative nim-lang/Nim#17926

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NFA macro + better NFA matcher #58

NFA macro + better NFA matcher #58

nitely commented Mar 16, 2020 •

edited

Loading

nitely commented Mar 24, 2020

nitely commented Mar 27, 2020

nitely commented Apr 4, 2020

timotheecour commented Apr 4, 2020

nitely commented Apr 4, 2020 •

edited

Loading

thomastay commented Apr 26, 2020

nitely commented Apr 26, 2020 •

edited

Loading

timotheecour commented Apr 26, 2020 •

edited

Loading

nitely commented Jul 16, 2020

NFA macro + better NFA matcher #58

NFA macro + better NFA matcher #58

Conversation

nitely commented Mar 16, 2020 • edited Loading

nitely commented Mar 24, 2020

nitely commented Mar 27, 2020

nitely commented Apr 4, 2020

timotheecour commented Apr 4, 2020

nitely commented Apr 4, 2020 • edited Loading

thomastay commented Apr 26, 2020

nitely commented Apr 26, 2020 • edited Loading

timotheecour commented Apr 26, 2020 • edited Loading

nitely commented Jul 16, 2020

nitely commented Mar 16, 2020 •

edited

Loading

nitely commented Apr 4, 2020 •

edited

Loading

nitely commented Apr 26, 2020 •

edited

Loading

timotheecour commented Apr 26, 2020 •

edited

Loading